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I. INTRODUCTION 
A. GENERAL 

Determining the relative motion between an observer and his environment is a 
major problem in computer vision. Its applications include mobile robot navigation and 
the monitoring of dynamic processes. Motion estimation also has many applications in 
image processing, such as coding, restoration, and reducing the noise by temporal 
filtering. | 

In computer vision, motion in images is recovered from a time ordered sequence 
of images. The relative motion between the objects in a scene and a camera, gives rise 
to the apparent motion of the objects in a sequence of images. This motion may be 
characterized by observing the apparent motion of a discrete set of features or brightness 
patterns in the images. The objective of motion analysis is the derivation of the motion 
of the objects in the scene through the analysis of motion features or brightness patterns 
associated with objects in the sequence of images. 

A closely related subject to motion estimation is the estimation of Structure of the 
imaged scene. Although structure may be computed independent of motion via stereo 
vision, knowledge of the object motion can facilitate establishment of feature 
correspondences within a pair of stereo images, thus aiding the determination of 
Structure. Indeed, psychological researchers have shown that apparent motion is a clue 
used by the human visual system for computing the scene structure [Ref. 1]. This close 
relationship between the estimation of structure and the estimation of motion has 


combined these two problems. 


In the following sections, the fundamental principles of several approaches for the 
estimation of motion and structure will be discussed. 
B. METHODOLOGIES FOR MOTION ESTIMATION 

Two distinct approaches have been developed for the computation of motion: 

1. Methods Based on the Use of Image Position 

This method is based on extracting a set of relatively sparse, but highly 
discriminatory, two-dimensional features in the images corresponding to three- 
dimensional object features, such as corners or occluding boundaries of surfaces. Such 
points, lines and/or curves are extracted from each frame and then inter-frame 
correspondence is established between these features. The observed displacement of the 
2D image features are used to solve the resulting motion equations. Constraints are 
formulated based on a rigid body motion assumption. This method assumes that 
correspondence is available between features extracted from one image in a sequence of 
images and those extracted from the next image. Establishing and maintaining such 
correspondence is a very hard problem. The ambiguity is produced by the effects of 
occlusion and noise which cause features to appear or disappear and also give rise to 
false features. Some of the techniques developed for solving the corresponding problem 
for stereo vision and optical flow algorithms may be applied to this problem [Ref. 1]. 
Analysis algorithms that rely on feature correspondence are termed feature- 

based algorithms. The feature-based algorithms will be explained in Chapter III in more 


detail. First projection methods, then the motion analysis algorithms [Ref. 2], [Ref. 3] 


based on these projection methods will be explained. An analysis of motion parameter 
error caused by correspondence errors will also be presented in chapter III. 
2. Methods Based On Optical Flow 
Optical flow is the two-dimensional field of instantaneous velocities of 
brightness values (gray levels) in the image plane. The optical flow techniques mainly 
rely on local spatial and temporal derivatives of image brightness values. There are 
several methods which have been proposed for computing optical flow of time-varying 
images. These methods are mainly based on: 
a. Local features 
The first step of this method is the detection of local features in each 
image. After obtaining the features a pair-wise matching between corresponding features 
in two frames must be obtained that minimizes an appropriate cost function. 
b. Spatitemporal gradient 
This method uses the first order spatial and temporal differentials of time 
varying images to estimate at each image point the component of motion in the direction 


of maximally increasing gray-scale intensity. This method is based on the equation: 


oP aor or (1.1) 
t 


where f is the image function, t is time, 6 is the partial derivative operator, u and v are 
the x and y components of the optical velocity [Ref. 4]. 
Equation 1.i alone is not sufficient to determine the optical velocities. 


Some constraints must be imposed, for example: 


1. Optical flow is smooth. 
2. Optical flow is constant and continuous over entire segments in image. 
3. Motion is restricted (e.g., planar motion). 
Using these constraints, some solutions are proposed [Ref. 5], [Ref. 6]. 
c. Fourier Phase Approach 

This method uses the shift property of the Fourier transform, and is the 
most appropriate method for determining the motion of a single object moving across a 
uniform background [Ref. 4]. This method is restrictive because only a single non- 
localized velocity vector is obtained for each image frame. 

The computation of optical flow requires the evaluation of first and 
second partial derivatives of image brightness values and also of the optical flow. The 
real images are noisy, in general. The evaluation of derivatives is a noise enhancing 
operation. As we increase the order of the derivative, we increase the sensitivity of 
noise. Also because of occlusion there may be some discontinuities in the optical flow, 
these regions must be detected reliably, otherwise the continuity assumption is violated. 

A new Fourier phase approach proposed by Burl [Ref. 7] based on the 
extended Kalman filter is discussed in Chapter II. The extended Kalman filter is 
implemented in spatial frequency domain and provides an estimate of the object velocity. 
Accumulative difference and binary image approaches are discussed in Chapter IV. 
Chapter V compares the performance of the algorithms for low signal to noise ratio 
(SNR) images. A discussion about the results of the computer simulations will also be 


presented at that chapter. 


Il. EXTENDED KALMAN FILTER ALGORITHM 

A. GENERAL 

An extended Kalman filter (EKF) is used to estimate the velocity of an object 
moving across an image frame and to reduce the effect of noise. The algorithm is 
proposed by Burl [Ref. 7] and the image sequences are modeled by dynamic nonlinear 
state equations. A simple dynamic model consists of a shift operator with the shift given 
by the velocity times the sample time. This model is given in both the spatial and spatial 
frequency domains. Application of EKF in the spatial domain requires a prohibitive 
number of calculations but implementation of the EKF in the spatial frequency domain 
reduces the number of calculations by a great amount. The resulting algorithm, referred 
to as the parallel extended Kalman filter (PEKF), is developed in detail and a number of 
practical considerations are presented in the Reference 7. As the velocity estimation 
error approaches zero, the EKF is shown to converge to a parallel set of third-order 
extended Kalman filters. This structure is referred to as the modified extended Kalman 
filter (MEKF). The single frequency EKFs are combined to yield an estimate of the 
velocity of the moving object using weighted least square algorithm. An ambiguity of 
multiplies of 27 in the frequency-velocity estimates is addressed in Reference 7. A 
proposed a two-step algorithm to solve the ambiguity utilizing some structures of the 
problem is also given in Reference 7. First, the object velocity is estimated using the 
subset of the spatial frequencies where there is no ambiguity, then this velocity estimation 
is used to estimate the ambiguities for other spatial frequencies. Another approaches to 


solve the ambiguity problem will be presented in Section II.C using the properties of 


EKF and the problem structure. An outline of the EKF algorithm is given in Section 
II.B. The simulation results of the proposed algorithm with varying SNR images is given 
in Section II.D. 
B. OUTLINE OF THE EKF ALGORITHM 

In the spatial domain, successive image frames composed of N by N pixels and 


containing a moving object is modeled by a nonlinear state space equation: 


+ n{k) 
xkst)-\ Me Ay ol A Gai 
vk+1)} | 0 Tho} [ne 











where x(k) is the state of the system, f(k) © R®” is composed of amplitudes of each 
pixel at time k stacked into a vector, v € R’ is the velocity vector in pixels per sampling 
time, and S(v) is a two-dimensional shift operator with the magnitude and direction being 
a function of the velocity vector, and n,k) and n,(k) are the image noise and the velocity 
noise, respectively. 


The corresponding measurement equation 1s: 


yA=fA+vA=[1 Ok()+v, (2.2) 


where y © R®™” is the measured image and vy € R™ is a zero mean, white, Gaussian 
random process with a covariance matrix R,,=07I. 

The spatial domain model described by Equations 2.1 and 2.2 is transformed to the 
spatial frequency domain by using the two-dimensional discrete Fourier transform. The 


state equation in the spatial frequency domain is then defined: 
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where F(k) is the fourier transform of f(k), D(v) is the transformed shift operator for a 
specific spatial frequency, and Z,(k) is the transformed image plant noise, Z,(k) is the 
transformed velocity noise. 


The corresponding measurement equation then becomes: 


YK=F()+N(K)=[1 OKX(Q +N, (2.4) 
where Y(k) and N/(k) are the Fourier transform of y(k) and v(k), respectively. 
The EKF uses the spatial frequency domain state equations, Equations 2.3 and 2.4. 
The linearized state equation is given as, 
5X(K+1)=A(X,)6.X(k) +Z(k), (2.5) 


where X, is the current estimate and 6X(k) = X(k) - X5, A(X,) is the Jacobian of the 
nonlinear state equations about X;. 
A summary of EKF structure is given below from References 8 and 9: 


State estimation: 


X(k+1|K)= a(X(k|k),0); (2.6) 
Covariance estimation: 
P(k+1|kK)=A(X(k|K) P(KIDA(X(K|K)) +Q,,3 Ca) 


Innovation: 


e(k|k) = Y(k+1)-Y(k+1|k) = Y(k+1)-CX(k+1|k); (2.8) 


Kalman Gain: 


G(k+1) = P(k+1|K)C71C P(k+1|k) C7 + elie (2.9) 
State correction: 
X(k+1 |k+1)=X(k+1 |k) +G(k+1)[Y(k+1)-Y(k+1 |k)]; (2.10) 


Covariance correction: 
P(k+1|k+1) = [ I-G(k+1)C]P(k+1|b). (2.11) 


The parallel extended Kalman filter structure is given in Figure 2.1. Reference 7 
outlines how the EKF converges to the MEKF as the velocity estimation error approaches 
zero. 


The state vector for a specific spatial frequency is defined: 


X, (| |ReF®) 
X(k)=|X.(0)|=| mF ® |, (242) 
x i3(*) @® Vv 


where X;,,(k) and X;,(k) are the Fourier coefficients of a specific spatial frequency and X;, 


is the frequency-velocity product. 





Figure 2.1 The Modified Extended Kalman Filter. 


C. VELOCITY ESTIMATION 
1. Weighted Least Square 
In Equation 2.12, X,,; is the estimated frequency-velocity product. These 
estimates can be combined to obtain the estimate of the velocity of the moving object. 
This estimation algorithm is complicated by an ambiguity of multiples of 27. The 


frequency-velocity product estimates can be modelled as, 


X,(k|k)=Qv+w+2nm, (2.13) 
where X;(k | kK) is the estimated frequency-velocity product, 9 contains the spatial 


frequencies, and w 1s a zero mean, random vector, with the estimated covariance: 


p> =Elww *|=diag|P, 43Po33,--»Pry233} (2.14) 
The subscript 733 denotes the [3,3] component of the state estimation error covariance 
matrix of the ith single frequency EKF. The vector m models the ambiguity whose 
elements are integer and constrained to be: 
|m.,| S$ LL cae = N, (2.15) 


since the object is constrained to remain within the image. 
The estimated X,(k | k) can be used to estimate the velocity that minimizes 


the weighted sum of the square error: 
J=(X,(k|)-Qv-2nm)7E ~(X,(k|k)-Qv-2nm). (2.16) 
The solution for v can be found by setting the gradient of J with respect to v to zero: 


< =Q7S-'Qv+Q7D~'(2nm)-Q7E"'X,(k|k)=0. (2.17) 


From Equation 2.17: 


v=(Q7E"'Q)"'(Q7E"'X,(k|k)-Q7E~'(27m)). (2.18) 
The search for optimal vector m can be simplified by utilizing some conditions of the 
problem as explained in Reference 7. From these conditions it is found that the ms are 


equal to zero for almost all spatial frequencies with magnitude given: 
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en ee (2.19) 


Wax tl 





This subset of spatial frequencies can be used to estimate the unambiguous velocity. 
Then this velocity can be used to estimate the m,s for spatial frequencies that do not meet 
the condition in Equation 2.19. The solution for vector m can be found by equating the 


gradient of J in Equation 2.16 with respect to m to zero: 


oy 


— =2nm+Qv-X,(k|k)=0. (2.20) 
om 
Using Equation 2.19: 
ae I [y Ts 
mh,=Round(——|X,(k|k)-«9;¥), (2.21) 


where Round(:) equals to nearest integer. Then Equation 2.18 can be used to estimate 
the v vector that minimizes the Equation 2.16. The frequency-velocity product estimates 
are fed back to single frequency EKFs. Once the filter has converged, the values of m 
will all be equal to zero. 
2. Constrained Least Square 

Another approach to estimating the object velocity vector from Equation 2.13 
is the constrained least squares method [Ref. 10]. We can select one of the frequency 
component satisfying Equation 2.19 denoted by C, and the corresponding estimated 
frequency-velocity product denoted by d, as a constraint equation. Then the linear 


constraint equation becomes: 


A 


Cy = d. (2.22) 


The problem now becomes: 


min(X,(k|k) -Qv)7(X,(k|k)-Qv) 
v (2.23) 


Subject to Cv =d. 
From Equation 2.23 we can form the following Lagrangian J: 
J==(K(k|K)-Qv) "K(k |B) -Qv)-A(Cv-d (2.24) 


The solution for v can be found by equating the gradient of J with respect to v to zero: 


AY _aTay-OTK,(k|k)-2C7=0. (2.25) 
av 


The solution for v then becomes: 


v=(Q7Q)" ATX, (k\)+(Q7Q)"C72. (2.26) 


We can write this equation as, 


= TQ\-"7T 22 
Vors=V¥,¢t(Q'Q) CTA, (2.27) 
where V,,; denotes constraint least square solution, and v,, denotes the least square 


solution. We can solve for \ by invoking the constraint defined in Equation 2.22: 


C¥o152Cv 6+ C(O) CT =a. (2.28) 


From Equation 2.28: 
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4=(C(Q7Q)"'C7]} (d-Cv,,). (2.29) 


Substituting Equation 2.29 into 2.27, we obtain the constraint least square solution: 


Vors=¥z6+(Q7Q)'C1C(Q™N)'C| (d-Cv,,). (2.30) 
3. Adding Fictitious Noise 
Since the ambiguity is not modeled in the single frequency EKFs, the system 
may be assumed to have a modeling problem. This wPOblET may cause the true 
estimation errors to become infinite if we don’t process the frequency-velocity estimates 
by one of the methods explained previously. Another way of solving this problem is to 
add fictitious process noise to the system as explained in References 8 and 9. 


The true frequency-velocity estimations are: 


= QV pnp: (2.31) 


3 TRUE 
We can add Qu, where p is a 2 by 1 constant vector, to the frequency-velocity estimates 
at every step. The velocity estimate then consists of the true velocity plus a bias. This 
bias is eliminated easily after the system has converged by using the fact that velocity of 
the object is multiple of integers since the velocity is assumed to be pixel/frame. If the 
elements of vector p is positive and the object velocity is negative then the algorithm will 
estimate the object velocity as true velocity plus N, where N is the size of image since 
the object motion is assumed to be periodic. Using that method eliminates the maximum 


velocity constraint which is imposed from Equation 2.15. 
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D. SIMULATION 

The velocity estimation algorithms are simulated using varying signal to noise ratio 
(SNR) images. A computer generated square object has been moved on a 16 by 16 
image. Artificial noise is added to the image for each experiment. The SNR of each 
image is calculated as signal power over noise power in dB units. Relative translation 
vector error is obtained as the norm of error over the norm of true vector for each 
iteration except the first one, since it is assumed the initial velocity estimate is zero. 
Figure 2.2 shows the result of experiments that have been performed using 6 iterations 
for each SNR and using a positive velocity and a positive » vector for the fictitious noise 
algorithm. The two-step weighted least square algorithm needs more iterations than 6 
to converge to the true velocity for low SNR images [Ref. 11]. The relative errors of 
this algorithm for low SNR are larger but they are smaller for high SNR images. The 
adding fictitious noise algorithm converges to the true velocity faster, since the added 
fictitious noise 1s in the shape of the expected value of true frequency-velocity product 
and the relative error becomes smaller for this algorithm. 

A second experiment is performed to indicate how the EKF algorithm enhances the 
image quality. The SNR of the input image and the SNR of the image estimated (output) 
by EKFs are compared. Figure 2.3 shows that for SNRs less than 10 dB there is an 
improvement in the SNR of the output image and this improvement is constant and about 


7 dB for very low SNR images. 
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Figure 2.2 Comparison of velocity estimation methods. 
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Figure 2.3 SNR improvement of EKF algorithm. 
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It. FEATURE-BASED ALGORITHMS 
A. PROJECTIONS 
1. General 

Image analysis requires an understanding of how the image was formed. 
How a two-dimensional pattern of brightness is produced in an optical image-forming 
system can be studied in two parts: first, we need to find the geometric correspondence 
between points in the scene and points in the image; then we must figure out what 
determines the brightness at a particular point in the image. The geometric 
correspondences are termed projections. 

Projections transform points in a coordinate system of dimension n into points 
in a coordinate system of dimension less than n [Ref. 12]. Here, we will use the 
projection from 3D to 2D. The projection of a 3D object is defined by straight projection 
rays (called projectors) emanating from a center of projection, passing through each point 
of the object, and intersecting a projection plane to form the projection. 

Projections can be divided into two basic classes: perspective and parallel. 
The distinction is in the relation of the center of projection to the projection plane. If the 
distance from the one to the other is finite, then the projection is perspective. If the 
distance is infinite, the projection is parallel. Figure 3.1 shows these two cases. The 
parallel projection is so named because, with the center of projection infinitely distant, 
the projectors are parallel. When defining a perspective projection, we explicitly specify 


its center of projection; for a parallel projection, we give its direction of projection. 


Ae, 


The visual effect of a perspective projection 1s similar to that of photographic 
systems and of the human visual system. The size of the perspective projection of an 
object varies inversely with the distance of that object from the center of the projection 
plane. 


The parallel projection is a less realistic view because perspective 


Center of projection 
at infinity 





Figure 3.1 (a) Line AB and its perspective projection. (b) Line AB and its parallel 
projection. 


foreshortening is lacking. The advantages of parallel projection is that the projection can 


be used for exact measurements and parallel lines remain parallel. 
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2. Perspective Projections 
We can approximate an optical imaging system with an ideal pinhole at a 
fixed distance in front of an image plane. Assume that only light coming through the 
pinhole can reach the image plane. Since light travels along straight lines, each point in 
the image corresponds to a particular direction defined by a ray from that point through 


the pinhole (Figure 3.2). 





Figure 3.2 Perspective Projection 


We define the optical axis in this system to be perpendicular to the line 
from the pinhole to the image plane. We can introduce a cartesian coordinate system with 
the origin at the projection point and the z-axis aligned with the optical axis and pointing 


toward the image. Let V = (X,Y,Z), the vector connecting O to A and V’ = (x,y,f), the 
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vector connecting O to A’, with f the focal length, that is, the distance of the image plane 
from nodal point O; and (x,y) are the coordinates of the point A’ on the image plane in 
the coordinate system with origin at the point of the intersection of the image plane with 
the optical axis, and axes x and y parallel to the axis of the camera coordinate system OX 


and OY. By similar triangles, we can easily write: 


phy ff 3.1) 


Z Z 


These equations relate the image coordinates to the world coordinates of a point. 
Further, to simplify the equations we may assume f = J without loss of generality. 
3. Parallel Projections 

Parallel projections are categorized into two types depending on the relation 
between the direction of projection and the normal to the projection plane. In 
orthographic parallel projections (Figure 3.3), these directions are the same, so the 
direction of the projection is normal to the projection plane. In oblique parallel 
projections, the projection plane is normal but the direction of the projection is different. 
If the direction of projection makes a 45° angle with the projection plane, it is called 
cavalier, if it makes a 63.4° angle, it is called cabinet projection. For our case 
orthographic projection is considered. 

We have a plane that lies parallel to the image plane at Z = Z, in the 
previous perspective projection model. We define the magnification m, as a ratio of the 


distance between two points measured in the image to the distance between the 


20 


corresponding points on the plane. Consider a small interval on the plane (dX,dY,0)’ and 


the corresponding small interval (dx,dy,0)’ in the image. Then: 





Figure 3.3 Orthographic Projection 


where -Z, is the distance of the plane from the pinhole. The magnification is the same 
for all points in the plane. 
A small object at an average distance -Z, will produce an image that is 


magnified by m. The magnification is approximately constant when the depth range of 


21 


the scene is small relative to the average distance of the surfaces from the camera. In 


this case we can simply write for projection equations, that 
x=mX,y=myY. (3.3) 


with m = f/(-Z,) and -Z, is the average value of the depth -Z. Often the scaling factor 
m is set to 1 or -1 for convenience. Then we can further simplify the equations to 


become: 


K=X,. y=! (3.4) 


These equations model the orthographic projection model (Figure 3.3), where the rays 
are parallel to the optical axis or the focal length is infinite. 

The difference between perspective and orthographic projection is small when 
the distance to the scene is much larger than the variation in distance among objects in 
the scene. 

B. MOTION EQUATIONS UNDER PERSPECTIVE PROJECTION 

We can analyze the relation between the image plane motion and the corresponding 
three-dimensional motion for the case of perspective projection. For simplicity f (focal 
length) is assumed to be unity and image plane is assumed to be Stationary. 

A rigid body with coordinates (X,Y,Z)’ moves with a translational velocity V; = 
(U,V, W)’ and rotational velocity 2 = (A,B,C)’. From kinematics the three-dimensional 


velocity of any point on the surface can be written as: 
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oe oO X-Y.Z). (3.5) 


dt’ dt’ dt 


We can write this equation in component form as: 


X=U+BZ-CY; (3.6) 
Y=V+CZ-AZ: (3.7) 
Z=W+AY-BX. (3.8) 


Let (x,y) denote the coordinates of a point in the image plane. As explained before for 


perspective projection the relation between object point P, and the corresponding image 


point p is: 
x5; (3.9) 
yes. (3.10) 


The optical flow at each point in the image plane is the instantaneous velocity of 
the brightness pattern at that point. So the optical flow of the point (x,y) can be denoted 
by (u,v) where: 


u=X, v=y. (3.11) 


Differentiating Equations 3.9 and 3.10 with respect to time and using Equations 3.6, 3.7 


and 3.8 we obtain: 


23 





ya Ba eS rr ang. 
<= +B(1+x2)-Axy-Cy. 
In the same way: 
v= = W eBxy-A(y?+1)+Cx. (3.13) 
We can also write these equations in the form of: 
U=U,+U,, V=V.+V, (3.14) 


where (u,,v,) denotes translational component of the optical flow and (u,,v,) the rotational 


component. 





u= es ; u_=B(1+x*)-Axy-Cy; (3. 15a) 
ye v_=Bxy-A(y?+1)+Cx. (3.15b) 


From Equations 3.12 and 3.13 we can eliminate the unknown depth variable Z. From 


Equation 3.12: 


U-xw 


Zu=U-xW+B(1+x?)Z-AxyZ-CyZ=Z=—_______—___, 
u-B(1+x?)+Axy+Cy 


(3.16) 


and from Equation 3.13: 
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V-yW 


_-——___—-- —__.. (S17) 
v+A(y?+1)-Bxy-Cx 


From these two: 
U-xW _u-B(1 +x") +Axy+Cy_ (3.18) 
V-yW y+A(y?+1)-Bxy-Cx 
Equation 3.18 describes the constraint imposed by the measured value of the optical flow 
(u,v) at any image point (x,y) on the six motion parameters (U, V, W,A,B,C). 

Consider one point P = (X,Y,Z) before the motion with image point (x,y). Suppose 
that the point moves with a general motion, and goes to point P’ = (X’, Y’,Z’) with image 
point (x’,y’). Any three-dimensional rigid body motion is equivalent to a rotation by an 
angle © around an axis through the origin with directional cosines n,, 12, n; followed by 
a translation T=(aX,aY,4Z)’. The relation between coordinates of the point before and 


after the transformation is given by: 


CALE 2 VOOEAL IE (3.19) 


where R is 3 by 3 orthonormal matrix of first kind (i.e., det(R)=1), and it is defined as 


[Ref. 13], 


nt +(1 -n-)cos0 nn,(1-cos8)+n,sin8 n,n,(1-cos®)-n,sin8 
R=|n,n,(1-cos8) —n,sin® n; +(1 -n3)cos0 n,n,(1-cos6) +n,sinO |. (3.20) 


n,n,(1-cos0)+n,sin0 n,n,(1-cos8)-n,sinO n; +(1 -n3)cos0 


For image vectors the transformation becomes: 
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Z'(x'y,1)7=ZR(x,y,1)"+T. (3.21) 


If || 7|| #0, where ||¢|| denotes Euiclidean norm, Equation 3.21 can be written: 


/ A 
rail - RG Ie es Se 
M T (3.22) 
where jet. 
ITI 


We can establish a general relationship between the two sets of image coordinates at this 
point. Using this relationship we can then proceed to solve the motion equations and 
obtain the structure of the scene. 


From Equation 3.19 we can write: 


XWin "12 “3 1X] [AX 
Yl=|75) "22 To3/ YI+/ AY (3228) 
Z'\ 1731 32 T3314) [AZ 


and in component form: 


X'=r, X+rY+r,Z+AX; (3.24a) 
Y'=r,,X +r, Y+r,,Z+AY, (3.24b) 
Z' =F, X+r,,¥+r,,Z+AZ. (3.24c) 


From the perspective projection property, Equations 3.9 and 3.10: 
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x/= 


X! ry XtryVtr,Z+AX (7, ,X%+7 yt, ,)Z+ AX 
Zl ry X +r VrryZ (ry X+Fqgy4ry,)Z+AZ 


(3225) 


As we did before we can eliminate the depth variable Z from these two equations. 


Z! Py X+¥y¥try,Z+ OZ (64) X+0,Ytlg,)Z+ AZ | 


From Equation 3.25: 


AX-x'AZ 


x"(r 3X tT 32) +133) (7X47 Yt, 3) 
From Equation 3.26: 


AY-y’AZ 


ML... ce 
Y(74)X 41 yoy +1733) —(1 aX +1 yyy +193) 


Equating these two equation and solving, we obtain: 


xx'(A Zr, ,-AYr,,)+xy(AXr,,-AZr,,)+x(A Yr, ,-AXr,,)+ 
yx'(AZr,,-A Yr,.)+yy'(AXr4,-AZr,,)+y(A Yr,,-AXr,5)+ 


x'(AZr,,-AYr,,)+y (AXr,,-AZr,,) +(A Yr,,-AXr5,)=0. 


Also from Equation 3.29, it is possible to show this relation as, 


2) | 


(3.26) 


(3.27) 


(3.28) 


(GiZ9) 


[x’ y’ 1]Bly|=0, (3.30) 
l 
where 
Cie omeas 
E=|€, @5 &¢|, (3.30a) 
C,8e, "C5 
with 


e,=AZr,,-AYr,,, €,=AZr,,-AYr,,, €,=AZr,,-AYr,,; 
e,=AXr,,-AZr,,, €;=AXr,,-AZr,,, €¢=AXr,,-AZr,3; (3.31) 


e,=AYr,,-AXr,,, €,=AYr,,-AXr,,, €,=AYr,,-A Xr... 


The elements of E are called essential parameters defined in terms of motion parameters 
[Ref. 14]. Equation 3.31 relates image coordinates of feature points and the elements 
of the matrix E. Since these equations are linear and homogeneous in AX, AY, and AZ, 
the essential parameters can be determined up to a scale factor and the sign of the matrix 
E is arbitrary. We can solve for motion parameters from these essential parameters. 
The relative depth (depth scaled by magnitude of translation) of each point can then be 
determined from the motion parameters and the observed projections of the points. Note 
that if we assume motion is only translational then the R matrix will be the identity and 


the matrix E will become: 
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E-AZ 0 -AX (3.32) 


which is a skew-symmetric matrix. 


In general, it is possible to represent the matrix E in the form of: 
E=ER = [ExR, EXxR, EXxR,| = E, E, E,|, (S955) 


where R=/R, R, R,]. From Equation 3.22 it can be seen that the magnitude of 


translational vector (|| 7'|) and absolute depths of the object points (Z and Z’) can not be 





determined from monocular vision. Equation 3.22 will still hold when ||7]|, Z and Z’ 
are multiplied by any positive constant. Since only the direction of Tis known and E 
can be defined up to a scale factor we can normalize things by assuming ||7|| = J. This 
implies || E||? = 2 which can be proved: 

Z|? = tracel EET } = trace{ ER(ER)'} 


tracet ERR'E! } = tracel EE’ } 


2( AX? + AY’ + AZ’ ) = 2. 
From N point correspondences we can establish a matrix A in order to find the 


essential parameters: 


i / j / / / 1 
Xj%) xii x; yy YN yy x yi 


/ / / / eli 
Az?” XyVo X_ Vr V2 Vy % Yn | (3.34) 


/ / / / / / 
XnXn Xn x, YaXn Yn Ue x, Yn 1} 


be, 


If there is no noise, from Equation 3.29 we can write Ae=0, where e is a nine- 
dimensional vector formed by stacking the elements of the matnx E into a column. With 
noise we will try to find a vector A such that || Ah|| =min, subject to ||h||?=2. 

Since E has 8 degrees of freedom, we need at least 8 point correspondences to 
solve for E. If we use 8 point correspondences and if the rank of A is 8, then the null 
space is of dimension | and h is the vector of normV 2 in the null space. The solution 
is the eigenvector of A7A of norm V2 corresponding to the smallest eigenvalue. Then 
we can produce a solution; first finding the smallest eigenvalue of A’A and then its 
corresponding unit eigenvector h. Then E will be V2 times this h vector. Degenerate 
cases occur when Rank(A) < 8. If 3D points are coplanar then a degenerate matrix A 
results. 


Note that: E, satisfies the following equations: 


AY*+AZ* -AYAX -AZAX 
EE, = -AYAX AZ?+AX? -AYAZ |; (3.35) 
-AZAX -AYAZ AX?+AY? 


and also, 
IIT ID, 10) (3.36) 


Since we estimated E in the previous step, now we can solve the following mean square 


problem in order to find T. 
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Min |E'T\* Subject to |\T\°=1. (3.37) 
The solution is the eigenvector corresponding to the smallest eigenvalue of EE’. We can 


estimate the rotation matrix R, by minimizing ||E - E,R|| with respect to R or we can 


find R directly from the W matrix. Let the matrix W be: 
W-lW, W, W,)=[E,xE,+E,xE, EXE, +E,xE, E,xE,+E,xE,|. (3.38) 
Using the identity, (axb)xc = (a.c)b-(b.c)a, 


E|xE,+E,xE, =(E XR, )XE,+(EXR,,)xX(EXR,,) 
=(E,.E)R,,-(Ry EE, +(E(EXR,3)) Ry -(R,.- (EAR, DE, 
=Riy “(R, EE, +(Ri (Ry xEE, 
=Ry, (Ry) EE, +(R,2tR),).E,E, 
“Ry, (Ry, EE +R EVE, 
=R 


(3.39) 
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This proves that first column of R is E,xE, + E,xE;. Similarly other columns can be 
found. 


Using Equation 3.22 we can find the relative depths: 


V, ~ 2.2) such _ that, 
ITV iT 3.40) 
to"! 1)T-R(xy,1))|-2 = = min, 
IT] |7i 


using standard least square methods. The relative 3D position of any point at time t, and 


t, then becomes: 


6) 


/ 
R(&(ay,1)) + + ly) | 3.41) 


(X/.Y) ZT = 7] IT WZ) 
b 4) > 
OKDrR | xiyZ)-—F (3.42) 


C. ERROR ANALYSIS OF THE ALGORITHM 

The error analysis of the algorithm will be discussed at this section using the 
concepts explained in Reference 2. The feature points used in the algorithm are 
corrupted by noise. The sources of noise are feature detector errors, matching errors, 
quantization errors and system calibration errors. All these errors result in errors in the 
solution for the motion parameters and 3D structure of the scene. The computer 
roundoff errors can be made small enough to be ignored for error analysis. So, the 
primary source of noise is assumed to be the perturbation of image coordinates. 

The errors depend on the motion parameters and the scene structure. First the 
effects on the motion parameters and then the effects on scene structure will be 
discussed. 

1. Motion Parameters 

The error will depend the motion parameters and is discussed at this section. 
The magnitude and the direction of the translation vector and the rotation matrix elements 
will effect the errors that occur in estimating the structure and the motion of the moving 


object. 
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a. Magnitude of Translation 

If the magnitude of translation vector is zero, the estimate of the 
translation direction will be arbitrary and the estimated translation vector will be an 
arbitrary unit vector (since 7, x T = O, where 7, is a unit vector of the same direction 
as the translation). From Equation 3.41 we see that the relative depths of feature points 
can not be determined. When 7=0, the rank of A in Equation 3.34 can not be larger 
than 6 [Ref. 15]. R can still be determined in this case by picking up any h satisfying 
|| Ah || =min, since E is defined as E=E, x R in Equation 3.33. 

b. Direction of Translation 

Figure 3.4 shows the relation between 3D world coordinates and the 
corresponding image coordinates with motion. As mentioned earlier, motion 1s 
represented by a rotation followed by a translation. From Equation 3.30 and 3.33 we 
can write that x’E,Rx = O, where x’ and x denotes image coordinates at ¢, and ¢,, 
respectively. From this relation we can see that 7 is orthogonal to the cross product 
x’xRx. So the translation vector is determined from the relation 7. (x’ x Rx ) = O. 

Figure 3.5 shows the case where the translational vector is orthogonal 
to the image plane. It can be seen from the figure that the vectors x’ x Rx are spread 
over the X-Y plane around the origin. This locus is shown by a shaded area in the 
figure. Figure 3.6 shows the case where the translational vector is parallel to the image 
plane. This time x’ x Rx are confined in a small shaded area in the X-Z plane. The 
perturbations in the image coordinates will cause the product x’ x Rx to be perturbated 


away from the original positions as denoted by dark disks in the figures. As explained 


OS 





Figure 3.4 General Motion Relation 


before, T is determined from T.(x’ x Rx)=0O, so T is orthogonal to n vectors (for n 
correspondences) in the shaded area. Since the shaded area in Figure 3.5 spreads around 
the origin while the shaded area in Figure 3.6 is confined in a small area in one side of 
the origin, Figure 3.5 allows a more reliable estimate of T than Figure 3.6. 

The perturbation of Figure 3.5 will not leave the shaded area often as of 
Figure 3.6. This can be seen as follows. Assume the vector x’ is perturbed in the image 
plane. The perturbation disk is orthogonal to Rx, and almost parallel to the shaded area 
if Rx is near the optical axis. Similarly, the perturbation due to Rx is orthogonal to x’ 
and so it is also nearly paraliel to the shaded area if x’ is near the optical axis. Hence, 


such individual perturbations will not cause large errors in the estimation of T. 
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Figure 3.5 Orthogonal Translation. 


In the case of Figure 3.6, the perturbation disk is nearly orthogonal to 
the shaded area if Rx and x’ are near the optical axis as explained before. Therefore the 
perturbations of x’ and Rx in Figure 3.6 will cause larger errors in the estimation of T 
than those in Figure 3.5. It is easy to see from Figure 3.6 that the largest perturbation 
of the estimated translation direction is in the Z component. 

Both the shape of the shaded areas and the orientation of the perturbation 
disks imply that a translation orthogonal to the image plane allows more stable estimation 


of translation direction 7 than a translation parallel to the image plane. 
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Figure 3.6 Parallel Translation. 


c. Rotation Parameters 

The correlation between the rotation and translation is very complicated. 
A rotation about the optical axis is easy to distinguished from translation since no 
translation will give the same displacement field in the image plane. But, a displacement 
field generated by rotation about an axis parallel to the image plane (x or y axis) may be 
nearly the same as that generated by a translation. The differences between these two 
displacement fields are not very large, especially for short displacement vectors or at the 
center of the images. So, the algorithm may easily confuse translation with rotation in 
the presence of noise. As a result, rotations with a rotation axis parallel to the image 


plane are more sensitive to noise than other rotations. However, since the displacement 
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field in the image plane is mainly caused by translation in most cases, the effects of 
translation are more dominant. The rotation matrix R is determined using the translation 
vector 7 (Equation 3.39). A difficult case for the estimation of translation is therefore 
also a difficult for the estimation of rotation. 

From the above, we can conclude that long displacement vectors will 
generally result in more reliable solutions for rotation parameters than short ones in the 
presence of noise. To yield long displacement vectors, the KBaBe should be large and/or 
scene should be close to image sensor. 

2. Structure of the Scene 

The nine-dimensional unit vector / is determined up to a sign if and only if 
rank of A in Equation 3.34 is equal to 8. A necessary and sufficient condition for the 
rank of A to equal to 8 is given by Longuet-Higgins [Ref. 15]. They define as 
degenerate’ eight-point configurations for which the algorithm fails. A configuration 1s 
degenerate if as many as four of the points lie in a straight line, or if as many as seven 
of them lie ina plane. Degeneracy also arises if the configuration includes six points at 
the vertices of regular hexagon, or consists of eight points at the vertices of a cube. In 
the presence of noise, the rank of A is generally mathematically full even if the actual 
structure is degenerate. If the structure is degenerate the solution is unreliable. So, in 
the presence of noise, we should consider the numerical condition of the matrix A. 

If the projections of feature points are confined to a small portion of the 


images, only a small portion of the image resolution is used. This will result in a less 
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reliable solution. So, the configuration of the feature points should be such that its 
projection covers as much of the image as possible. 

In the discussion of motion parameters, we see that long displacement vectors 
will result in more reliable solutions. For the same amount of motion, the scene should 
be close to the camera so that it yields long displacement field vectors in the image 
plane. This condition is related to the numerical condition of matrix A. 

A very effective way to reduce the error in the solutions is by using more 
points than the minimally required 8. Since a noise-corrupted image vector may result 
in a large amount of error, it is better to use only reliable feature matches for motion 
parameter estimation. 

The resolution will affect the correctness of image coordinates. So, using 
high resolution will decrease the errors in the algorithm. Assuming resolution and focal 
length to be fixed, reducing the image size by a factor, say 2, will be equivalent to 
doubling the distance from the camera to the scene and reduces the variation in depth of 
the image. This is equivalent to reducing the resolution. So, a reduction of image size 
will worsen the performance of the algorithm. 

The following section presents statistical data obtained through simulations 


that demonstrate the comments made in the discussions. 
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3. Experimental Results 
In the simulations, the feature points are generated randomly within a cube 
of 10x 10x 10 units. The object distance is 11 units, the image size is 2 units, and the 
image resolution is stated for each simulation. 
All the errors shown in this section are relative errors. The relative error of 
a matrix or a vector is defined by the Euclidean norm of the error matrix or vector 
divided by the Euclidean norm of the correct matrix or vector. Relative errors are often 
simply referred to as errors, in the following sections. 
a. Simulation for Image Resolution 
Figure 3.7 shows the relation between the errors in the estimates and the 
image resolution. It can be seen that increasing the resolution by a factor 2 roughly 
decreases the errors by a factor 2 which is expected from previous discussion. 
b. Simulation for Number of Corresponding Points 
Figure 3.8 shows the relation between the errors in the estimates and the 
number of corresponding points. It can been seen that an increase in the number of 
corresponding points will decrease the errors as expected. 
c. Simulation for Image Size 
In order to show the effects of decreasing image size, the image size is 
reduced by a factor of 2. To make sure that the same scene is visible, the object is 
moved away from the camera by a factor of 2. The same simulation as in the simulation 
for the number of corresponding points is performed again. From Figure 3.9 it can be 


seen that reducing the image size worsens the estimates which is consistent with earlier 
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Figure 3.7 Simulation for image resolution. 
discussions. 
d. Simulation for Noise 

In order to see the effects of perturbations in x’ (second frame image 
coordinates) artificial image noises are added to the second frame image coordinates. 
The noise is added to each point in the second frame in a circle of radius r and centered 
at that point. This is done by perturbing the second frame image points from 0 percent 
to 10 percent (with 100 levels). To control the orientation, the sign of normally 


distributed random numbers are used. This experiment was performed several hundred 
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EFFECTS OF NUMBER OF CORRESPONDING POINTS 


~--- DEPTH Z 
~x—« TRANSLATION VECTOR T 
-o-0 ROTATION MATRIX R 

IMAGE RESOLUTION : 512 BY 512 


NUMBER OF CORRESPONDING POINTS 





Figure 3.8 Simulation for number of corresponding points. 
times. Averaging the results of these experiments Figure 3.10 is obtained. 
e. Simulation for Magnitude of Translation 
In order to see the effect of translation magnitude, a translation vector 
equal to (t, O, t) is used for simulation. The value of t is changed from 0.5 to 4.5 with 
rotation axis (J, 0, O) and rotation angle 8°. The results of these simulations is given 
in Figure 3.11. We can see that as the magnitude of translation increases, the error 


levels are decreased. 
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Figure 3.9 Simulation for image size. 
f. Simulation for Direction of Translation 
For this simulation the rotation axis is taken as (J, O, O) and the rotation 
angle 8°. The translation vector is taken as (0.5, O, k) where k is changed from 0 to 2 
using 20 evenly spaced values. The results of the simulations are shown in Figure 3.12. 
We can see that as the direction of the translation vector becomes orthogonal to the 
optical axis the error levels are decreased. This result is consistent with the previous 


discussion. 
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Figure 3.10 Simulation for noise. 
D. MOTION EQUATIONS UNDER ORTHOGRAPHIC PROJECTION 


We can analyze the relation between the image plane motion and the corresponding 
three-dimensional motion for the case of orthographic projection [Ref. 3]. The same 
notation and assumptions will be used as were used for the case of perspective projection. 


For general rigid body motion, Equation 3.19 is still valid. From N image point 


correspondences: 
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Figure 3.11 Simulation for magnitude of translation. 


(x9) *(x;,9) i=1,2,.. (3.44) 
The problem is to determine R, T, and (X,,Y,Z,) for N points. For orthographic 


projection from Equation 3.4: 


x=X, y=Y, (3.45) 


From Equation 3.45 it is obvious that 4Z can never be determined and the Z,’s can be 


determined only within an unknown additive constant. For the following discussion, 
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Figure 3.12 Simulation for direction of translation. 
we'll try to determine: R, aX, aY, X,, Y,, and ZZ, for i=1,2,...,N. 
1. Two-view Case 
We assume that two orthographic views at time instants ¢, and ¢, are taken of 
a rigid object moving in the 3D object space with N point correspondences. We’ll show 
that no matter how large N is, the problem has an infinite number of solutions. First, 
we can decompose the motion from f, to ¢, into a rotation R around the point (X,, Y,,Z,) 


followed by a translation: 
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GG 
T,= y,-Y, (3.46) 
Z,-Z, 
Second, to determine R and (X,,Y,,Z-Z,), we can move (X,Y,Z) and (X’,Y’,Z’) to the 


origin: 


Xx, 0 x 
Y,| = |0} = |¥/1. (3.47) 
Z, 0 Zi 


Then all other points are related simply by a rotation from /, to f,: 


x [x, 
y}=RY,| i =2,3,...,N. (3.48) 
Z Z; 


From Equations 3.48 and 3.45 we can write: 


Xe MAYA 32) (3.49) 


Yi = May Xj tT y2VjtT x32; 
In matrix notation: 


/ 


x; 


r 
’ "z, (3.50) 
To3 


To eliminate Z;, premultiply both sides of Equation 3.50 by /r,;,-r,;/ to get: 

















/ 


y!| [For Too 
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x; x; 


ros asl | = [asMuisla Max’ ais" | 


(3.51) 


Using the identities: 
T1793 7911137 3» (3.52) 
Pat 9371 29" 1373p 


which follow from the fact that each row of R is cross product of the other two, we get: 


Patio ar aia), = 0, PE 
which is linear and homogenous in the four unknowns 7,;, 723, 7;;, and r;,. From N point 
correspondences, we get (N-1) such equations. Therefore, if N = 4 and assuming that 
the points are not coplanar, we can solve the set of Equations 3.53 to obtain 7,3, 723, 73), 
and r;, to within a scale factor. In the absence of noise, four point correspondences are 
sufficient to solve Equation 3.53, additional point correspondences are superfluous. The 
important point to notice here is that Equation 3.53 contains all the information we can 
get from the point correspondences. Therefore, no matter how many point 
correspondences we may have, the only thing we can determine about R is the values of 
T13» 123, T3;, and r;, to within a scale factor. Obviously, by changing the scale factor, we 


can get infinite number of solutions of 7,;, 723, 73), 73. Which satisfies: 


a er 3.54 
T13tlo3 = 13,+13. < 1, ( ) 


which follows from the fact that: 
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7a! rae) 2 
Figtlo3 = 13,413, = 1-143. (3.00) 
For any of these infinitely many solutions, we can construct an R. For each solution of 


R, we can use Equation 3.50 to find Z;; for i=2,3,4. Z,’ can be found from: 


7 = P33Xj+T 32) +1 338;- 19.) 
We remark that as a result of the above derivation, we have a way of testing 

whether a set of proposed point correspondences is legitimate. Assume that we have four 

points over two views, then there are 4! = 24 different mapping between the two sets 


of points. For each mapping, we can solve Equation 3.50 to get r,;, 723, 73), 73. to within 


a scale factor. The mapping is permissible if and only if 


Perky = rherk 3.62) 

In summary, no matter how many point correspondences we have, two 
orthographic views result in an uncountable infinite number of solutions for motion 
parameters and the structure of rigid objects. We also have derived a simple test for the 
legitimacy of point correspondences [Ref. 3]. 

2. Four Points Over Three View Case 

We assume that the image plane is stationary and that three orthographic 
views at time instants ¢,, t,, and t,, respectively, are taken of a rigid body moving in the 
3-D object space. We again use the notation explained before, except that coordinates 


at ¢; will be double primed, and the rotation matrix from £ to t; is denoted by: 
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S = |S2; Sy2 $45), (3.63) 


and the rotation from f¢, to ¢; by: 


W = |Wy, Wo. Ws. (3.64) 


Thus, 


W = SR. (3.65) 


Assuming we are given four point correspondences over three views, again we can let: 


1 > mS 0 
Y,| = Ye 7 Y, aay) (3.66) 
Z 0 
i) {Z,) |Z, 
Then 
7 Xe 
Y'| = RY;|, (3.67) 
z! Z; 
and 
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y”| = sy’| (3.68) 
7 7 
x; 
Y’| = WY,| — i=2,3,4. (3.69) 
ae j 


Using the method explained in the previous section, we can determine (7, ;,7>3,73),732), 


(S15,523,537,532) and (W,3,W23,W3),W 32) to within scale factors: 


(7135793731732) = ©(P13:P232P319P39)> 
(S13593253 1532) = B(O 1355310415039) 
(W13,W39W31W32) = ¥(W13,W23,W3),09). 


(3.70) 


where p;, 0;, and w,; are known a, f, and yy are unknown constants which are assumed 


to be nonzero. 


Equation 3.65 is the only constraint we have on r;, 5;, and w, We can 
rewrite it: 
Wii io ee Sip Syn 5301 12 "13 
3.74 
Wo, Woo Waal = [521 So2 532]721 722 1231- ( ) 
W3, W3. Ws; S31 53. 5331731 "32 133 
Multiplying out, we get: 
Wii Wi Sip Syl "12 $13 (3.72) 


= 2 r.. r 
31 32}: 
Wo, Wy S51 So917o1 "29 553 
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w S.. SIF s 
3 
W3 S41 594723 543 


and 


nD (3.74) 
[M31 Wo] = [S31 S32 + S731 139) 
Jay 5D 
13 
7 7 
W,, = [S31 sa | Sear aa" (3.75) 
23 


From Equations 3.73 and 3.74, we can determine a, f, and y. Then, we can find R and 


S. Premultiplying both sides of Equation 3.73 by /s,;,-5,,/, we get: 


w 
[Sy3 543] 








13 
i [$4354 1-54 3521 sa toto (3.76) 


13 
Wo, ly; 


Or, 





mal 3 
Secs =[-s,, Ss Mi?) 
| 23 a, | 32 re 
From Equation 3.70: 
BY (Gy30 13-9 13093) a Ba(-04)P13+93,P 43); (3.78) 


and 
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@ O33 ies 


Y ~939P13+93)P 43 


Similarly, postmultiplying both sides of Equation 3.74 by /r;,,-r3)]’ yields: 


W31P32— W3P 3) 


~931)P23+943P 13 


B 
1 
Thus, a/y and £/y are determined if: 


V y839—To353, #0. 


Next, premultiplying both sides of Equation 3.73 by /S,;,5,;/, we get: 


Maia 
S13 593 7 
23 





; 
13 2 2 

= [815841 752359) el [shes 33? 
23 


Or, 





Wi3 ri3 
2 2 

S13 523 = [533551 —Sy3530] [+053 +523)733- 

W3 93 
From Equation 3.70 we get: 

Dp ie Z 
BOG 13013494303) = -S3,80(03)P13+F32P 23) +B (913 + 923)335 
or, 
aw ieee tey 2 
(930 1,+0,,0.,) = “7 (231P13* FszP2a)S334 (913+ S23)" 3: 


Similarly, postmultiplying both sides of Equation 3.74 by /r3;,73/’ yields: 


39) 


(3.79) 


(3.80) 


(3.81) 


(3.82) 


(3.83) 


(3.84) 


(3.85) 


Ce 
(3)P3)*32P32) = “(0510 13° On2PasV¥as* (031 * Psy (3.86) 


A unique solution for (B/y)r,; and (a/y)s;; can be determined from Equations 3.85 and 


3.86 if: 
2 2 
(S13 +53) — ($557 )3 +839703) wy (3.87) 
— (8357 )3 +5373) (7 i AL i) 


Then, since a/y and {/y have already been obtained, we can determine r,, and s,, 
uniquely. 


From: 


rig *lo3 4733 = ik (ee) 


and from Equation 3.70, we have: 





a7(pi3+P23) = 1-133, (eon) 
or 
1-r5, 
a? = (3.90) 
P13+P 23 


Thus, we have two solutions for a: 
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Goo 





From Equations 3.79 and 3.80, we have two solutions to a, #, and y. Then, we have 
two solutions for 73, To3, 1371 M32. 133° S13. S23» S31, S32. 533, at this moment. For each 
solution, we can determine the remaining elements of R and S§ by the method described 


below. To find r,, and r,,, we may use two properties of the rotation matrix: 


Pai taal 12 = 33713 (3.92) 


Paoli Tal 12 = —T 23° (3.93) 


Equations 3.92 and 3.93 denotes two linear equations with two unknowns (r,, and 7), 


and the coefficient matrix has the determinant: 


31 132 


=e 3 es 30> Bee 


32 ~731 


which is assumed to be nonzero. Therefore we obtain a unique solution for 7,,, Tp. 
Similarly, a unique solution for r,, and r,, can be obtained. Using the same method, s,,, 
Sia; S57) O57 Call Oe LOUNGE 

Using Equation 3.50, the Z,’s can be determined for i=2,3,4. As pointed out 
in [Ref. 10], four point correspondences over three frames yield two solutions for motion 
and structure. The point configurations of these two solutions are reflections of each 


other with respect to the image plane. 
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E. SIMULATION 

Computer generated random numbers are used as feature points when simulating 
the algorithm. The second and the third frame image points are perturbed within a circle 
with radius equal to a varying percent of the original values. This experiment is 
performed several hundred times and the results are plotted on Figure 3.13. 

From Figure 3.13 we can see that the algorithm is very sensitive to the perturbation 
of the feature points. Even 0.1 percent perturbation is enough to obtain very large 
errors. So the algorithm is valid only for ideal cases and can not be used with low SNR 


images. 
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Figure 3.13 Simulation for perturbation. 
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IV. DIFFERENCE IMAGES 

A. GENERAL 

One of the simplest approaches for detecting the changes between two image 
frames taken at different times, 1s to compare these two images on a pixel by pixel basis. 
One way of doing this is to form a difference image (DI). A difference image is a 
binary image generated by comparing the two frames. The DI is generated by placing 
"1" in those pixel positions for which the corresponding pixels in the two frames being 
compared have an appreciable difference in their gray level characteristics. This 


operation is stated as: 


] if Kx, yt) —fix.y,t) [>7, (4 : 1) 


i Boyt) = 
d : 0 Otherwise. 


where /(x,y,t,), f(x,y,t,) are image frames at times #; and ¢, respectively, f,(X,y,t,,t,) 1s the 
difference image and 7 is a threshold. The operation is computationally straightforward 
because it involves only the subtraction of corresponding pixels. This approach is 
applicable only if the illumination is relatively constant within the bounds established by 
threshold. 

Difference images alone reveal little information as to the higher level nature of 
scene and sensor change as reflected in the image plane. For example, the difference 
images will reflect the combination of motion effects in the case of several moving 


objects [Ref. 16]. 


ay 


Difference images are very vulnerable to noise. 1-valued entries in fj in Equation 
4.1 often arise as a result of noise. The removal of noise may be achieved by forming 
4 or 8 connected regions of 1s in f, and then ignoring any region that has less than a 
predetermined number of entries. Unfortunately, this results in ignoring small and/or 
slow-moving objects [Ref. 17]. 

The concepts explained above are illustrated in Figures 4.1 and 4.2. Figure 4.1 


shows an image frames taken at times 4, and ¢, containing a single object of constant 


IMAGE FRAME AT TIME tj 


8 190 
IMAGE FRAME AT TIME ti 





Figure 4.1 Image frames at times t, and t,. 
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intensity denoted by Lls. Figure 4.2 shows the difference image computed using 
Equation 4.1 with a threshold larger than the constant background intensity. Note that 
two disjoint regions R, and R, in the difference image are generated. The region of 
intensity R, which is called the leading edge is generated because of “covering” the 
background and R, which is called the trail edge is generated because of "uncovering" 


the background over time interval ¢-t;, The region between R, and R, 1s zero because of 


DIFFERENCE IMAGE 





Figure 4.2 Difference Image. 


the presence of the object in this region in both images and the object has constant 


59 


intensity. Since the background is assumed to be unchanged in both images the 
background region will also be zero in the difference image. 
B. ACCUMULATIVE DIFFERENCES 

A sequence of images /(x,y,t,), f(x,y,t,), ..., f(x,y,t,) are taken at times ¢,, t,,..., t, 
respectively and let f(x,y,t,) be the reference image. An accumulative difference image 
is formed by comparing the reference with every subsequent image in the sequence. A 
counter for each pixel location is incremented every time there is a difference at that 
pixel location between the reference and an image in the sequence. Thus, when the k”™ 
frame is being compared with the reference, the entry in a given pixel of the 
accumulative image gives the number of times the gray level at that position has been 
different from the corresponding pixel value in the reference image. At each time, 
differences are established by using Equation 4.1 then they are summed up to generate 
the accumulative image. 

These concepts are illustrated in Figure 4.3. A rectangular object has been placed 
at the upper left corner of the picture then moved to the right and down at a constant 
velocity of 1 pixel/frame (column velocity) and 2 pixel/frame (row velocity). Figure 4.3 
shows the corresponding accumulative difference images for the 2nd and 4th frames 
respectively. The velocity of the object can be estimated from the repetition times of the 
homogeneous rows (every element in the row is equal to each other) and homogeneous 
columns (every element in the column is equal to each other) in the accumulative image. 
From Figure 4.3 we see that there are one homogeneous column and two homogeneous 


rows at each time. This shows that the column velocity is 1 and row velocity is 2. 
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ACCUMULATIVE IMAGE AT TIME:t2 





Figure 4.3 Accumulative Difference Images. 


Finding velocity from the accumulative image becomes very hard if noise is present 
in the image since some pixel counters may be incremented because of the noise. This 
effect can be reduced using 4 or 8 connected regions as explained previously. Also, it 
can be seen from Equation 4.1 that selecting the proper threshold value is very important 


for estimating the correct object velocities. This is illustrated in simulation section. 
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The object velocity can also be estimated from the changes in the center of the area 
that the object covers at each picture. This can be accomplished by obtaining binary 
images for each frame. This binary image algorithm is explained in the next section. 
C. BINARY IMAGES AND VELOCITY ESTIMATION 

Binary (two-valued or black-and-white) images are obtained by setting pixel values 
to "0" for all image points corresponding to the background and setting pixel values to 
"1" for all the image points on the object. If the object appears consistently darker (or 
brighter) than the background, it is easy to generate the binary images, since generating 
binary images simply requires thresholding the pixel values. If the object has similar 
gray-level values as the background and/or noise is corrupting the image, the 
thresholding procedure becomes harder. As explained in Reference 5, Histogramming 
or Segmentation methods should be applied in this case to obtain the binary image. 

Assuming the image has been digitized with n rows and m columns, the area of the 


object can be computed in units of the area of a picture cell: 


A= > Py (4.2) 


where p, is the value of binary image at the i™ row and j" column. The x-position of 


center of the area (COA) can be found by using: 
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A= one. (4.3) 
Ata 


where k, is the x-projection of the image, which gives for each column the number of 
picture cells in the column that have the value one. Similarly, the y-position of the COA 


can be found using: 


A,==>. jh, (4.4) 
Aji 


BINARY IMAGE 


10 15 


X=PROJECTION 





Figure 4.4 Binary Image, X and Y Projections. 
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where h, is the y-projection of the image, which gives for each row the number of picture 
cells in the row that have the value one. These concepts are illustrated in Figure 4.4. 
Assuming the object is mgid and each image is generated by using orthographic 


projection, the x-component of the object velocity can be found simply from: 


_ Ay, “A, 
x am > 


t-t 





(4.5) 


where A,,, A,, are the x-components of COAs at time f, and ¢,, respectively. Similarly, 
y-component of the velocity can be found from: 
yon! (4.6) 
ono) 
where A,, and A,, are the y-components of COAs at time ¢, and ¢,, respectively. 

Using the algorithms explained in this chapter, some experiments has been 
performed with real images and computer generated images. The simulations and the 
results are presented in the next section. 

D. SIMULATION 

The concepts explained for accumulative differences are applied to a computer 
generated square object. The object has been moved at a rate of 5 pixels/frame right and 
5 pixels/frame down. The accumulative difference for 8 frames is obtained as explained 
above. The resultant image is shown on Figure 4.5 from which it is possible to estimate 
the object velocity since no noise is added and the object has a regular shape. If the 
image has noise, and/or has irregular shape, it is very hard to estimate the velocity of 


the object. To demonstrate this the car image is used which is shown in Figure 4.6. 
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Figure 4.5 aonet, generated saaiailative aifferenge image. 

The accumulative difference for this image for 3 frames is shown on Figure 4.7. Since 
the background is constant for three frames, it is eliminated except for the regions the 
car covers or uncovers. Velocity estimation is very hard in that case. The only thing 
we can obtain is the type of motion, [Ref. 17]. So, we can say that this algorithm is 
adequate only for an "early warning system" which detects only changes and the 
direction of the motion. 

The concepts explained for binary images are simulated using computer generated 


binary images. A square object has been moved through the image plane and artificial 
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Figure 4.6 Caz rage. 
image noise is added at each frame. To reduce the effect of noise, a thresholding 
process is applied to the frames, then only nonzero four-connected pixel regions are used 
to estimate the motion using the changes in the COA. The relative errors for different 
SNR are obtained. Figure 4.8 shows the result of the first experiment with a threshold 
of 0.5. Figure 4.9 shows the result of another experiment with a threshold 0.25. 

From Figures 4.8 and 4.9, we can see that as we increase the threshold the relative 


error becomes smaller, but we may ignore the small changes and slow moving objects 


which is consistent with the previous discussion. The error levels for high SNR may be 
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Figure 4.7 Accumulative difference car image. 


acceptable for some specific applications, but for low SNR the relative errors are very 


high and this algorithm does not provide a good velocity estimate. 
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Figure 4.8 Simulation result with threshold 0.50. 


68 


DIFFERENCE IMAGE 


fe 
eo) 
sa 
we 
LJ 
z 
= 
Lid 
oO 





Figure 4.9 Simulation result with threshold 0.25 
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V. CONCLUSIONS 

The performance of the EKF algorithm, linear feature-based algorithms, and the 
differencing algorithm are evaluated on low SNR images after outlining the algorithms. 

Chapter II presents the EKF algorithm which is implemented in the spatial 
frequency domain. Implementing the EKF in the spatial frequency domain decreases the 
required computations greatly with respect to a straight forward implementation of the 
EKF [Ref. 7]. Estimating the object velocity from the frequency-velocity product 
necessitates the use of more image frames than the other algorithms discussed in this 
thesis. 

The EKF algorithm assumes constant background and provides a two-dimensional 
nonlocalized velocity vector estimate. Reference 11 summarizes the performance of the 
algorithm using zero background or a checkerboard background, at varying noise levels. 
It is shown that as the noise level increases and/or the object in the image plane does not 
have constant image brightness, such as a pyramid object, the number of iterations 
required to converge to the true velocity increases greatly. 

For low SNR images, the EKF increases the image quality greatly, as shown in the 
simulation part of Chapter II. This technique produces much better velocity estimates 
for low SNR images than the other algorithms. 

Although a detailed overview of the algorithms based on optical flow are not 
presented in this thesis, some of the results of Reference 6 will be outlined for 
comparison purposes. Optical flow algorithms restrict the motion to be smooth and small 


thus requiring a high rate of image acquisition. It also requires that motion vary 
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continuously over the image plane. These two restrictions are effected by object 
occlusion and initial and boundary conditions. Since this algorithm mainly relies on first 
and second order derivatives of image brightness values, the noise is enhanced during 
these operations which results in more sensitivity to noise. This algorithm does not 
provide usable results for low SNR images. 

Feature-based approaches, as discussed in Chapter III, strictly require that 
correspondence be established between image frames. It is the author’s belief that much 
work should be done in this area to improve the performance of these algorithms. It is 
shown in Chapter III that even small relative perturbations of feature points can result 
in large relative errors. Noise and occlusion worsen the establishment of 
correspondences between features and decrease the performance of the algorithms. One 
way of decreasing the sensitivity to noise is to use more than the required number of 
features in least squares technique. This can have a smoothing effect but it may also 
cause additional complications. For example, the computation time is increased for the 
establishment of correspondences. For ideal cases these algorithms produce very 
desirable results. These are the only algorithms compared in this thesis that provide 3 
dimensional velocity and structure estimation. This characteristic is the main advantage 
of these algorithms. 

The accumulative differencing algorithm provides computationally straightforward 
motion estimation. Unfortunately only changes in the image scene and the direction of 
the motion can be estimated with real images. Differencing the images iteratively 


reduces the SNR of images processed by the algorithm which results in poor estimation. 
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Using the changes in the COA of binary images also provides a straightforward motion 
analysis method, but the thresholding and the histrogramming techniques used by the 
algorithm are vulnerable to noise. In Chapter IV, it is shown that those algorithms do 
not provide good motion estimation for low SNR images. 

The preference of the motion estimation algorithm generally depends on the kind 
of application and the expected SNR of images. For high SNR images, feature-based 
algorithms are preferable; they provide three-dimensional information on motion and 
structure. The perspective projection method can provide realistic results which are 
similar to those obtained with the human visual system. If the object distance is large, 
then the variations in the distance to the object points can be ignored and orthographic 
projection is a good approximation to perspective projection. 

Some specific applications such as “early warning systems" may require only the 
detection of changes in the image plane or low level motion estimation. Then, because 
of its computational simplicity, differencing algorithms may be preferable. But, for low 
SNR images, simulations have shown that the EKF provides the best motion and 
structure estimation. The EKF also enhances the image quality at low SNR. So, for low 
SNR images, the EKF algorithm is preferable. For future work, the EKF algorithm can 


be improved to also estimate rotational motions and the depth of the object. 
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